DataTransformerRegistry.enable('vegafusion')
Predicting Income Category From Socioeconomic Characteristics
Executive Summary
We built a classification model to predict an individual’s income group, split by whether they are high earners (> USD 50,000) or low earners (<= USD 50,000). Using a Logistic Regression classifier, our model accuracy was 78% on unseen test data with an associated F1 score of 0.602. To address the class imbalance in the data, we used a balanced weight approach while buidling our model. We also sought to understand what socioeconomic characterstics play a the biggest role in determining an individuals income group. Using SHAP analysis, our findings show that of the features in our model, Marital Status, Age & Education are the biggest drivers of a High Income output.
While the Logistic Regression classifier was chose to easier identify the socioeconomic features that are drivers of high income, we see an opportunity to use an ensemble model such as Random Forest Classification to improve the model’s prediction metrics.
Introduction
How is an individual’s income affected by other socioeconomic factors? This is the question our team set out to investigate. Socioeconomic status here is defined as a way of describing people based on their education, income and type of job (National Cancer Institute (n.d.)). With the diversity of backgrounds that can exist in society, we set out to understand what factors contribute most to an individuals income.
In this analysis, we use machine learning to predict whether an individuals income is above or below $50,000. As the government sets out massive investment in Canadian societies to improve the lives of citizens(Housing, Infrastructure and Communities Canada (2025)), we envision our analysis as a means of providing insights to the government as to what investments can drive the best chances of improving an individuals life. The persistent income and wealth inequeality increase presents a strong case for prudent investing to improve lives across all Canadians. (Yassin, Petit, and Abraham (2024))
Methods
Data
For our dataset, we use the Adult dataset sourced from the UC Irvine Machine Learning Repository (Becker and Kohavi (1996)). The dataset contains 14 features obtained from census data to describe an individuals attributes. The target is a categorical column comprised of a binary outcome of whether an individual earns more than USD 50,000(>50K) or USD 50,000 or less (<=50K). The data and the descriptions fo the corresponding attributes can be explored using this link
Exploratory Data Analysis
Prior to model fitting and feature selection, we first perform EDA to understand the distribution of our features as it relates to our target.
Below, Table 1 shows a snip of our dataset, highlighting all the columns, as well as a small portion of the data.
This code needs to return a csv document of the adult_df.head(5) output like below.
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
| 2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
| 4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
| 5 | 37 | Private | 284582 | Masters | 14 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 0 | 0 | 40 | United-States | <=50K |
Data Validation Check (Will be deleted)
To ensure the trustworthiness and reproducibility of our analysis, we perform a strict data validation check on the loaded raw data. This validation uses the custom DataValidator class from our src/validation.py module to verify critical aspects of the dataset: - File integrity – confirming the file exists and is in the correct format. - Structure validation – ensuring all expected column names and data types are present. - Data quality checks – verifying that missing values are within acceptable limits and that no rows are completely empty. - Duplicate detection – confirming the dataset contains no duplicated observations. - Outlier assessment – checking that extreme values in numerical columns do not distort the analysis. - Categorical level verification – confirming that all categorical features follow the allowed levels defined in the data description. - Target distribution check – ensuring the target/response variable follows an expected distribution. - Correlation anomaly detection – identifying unusually high correlations between the target and numeric features, as well as across features.
If the data fails any of these checks, the DataValidationError will be raised, and the notebook execution will be halted. This prevents us from proceeding with downstream steps like modeling and visualization using corrupted or unexpected data.
Added project root (522-group33-income-indicators) to sys.path.
Data file format (CSV) is confirmed and the file exists.
--- Starting Data Validation Checks ---
Column names and critical data types are correct.
No entirely empty observations found (i.e., no completely missing rows).
Missingness in all columns is within the 5% threshold.
No duplicate observations found.
No outliers found in numeric columns.
No anomalies found in categorical columns.
Target distribution matches expected proportions.
No anomalous correlations found between target and numeric features.
No anomalous correlations found between numeric features.
--- All core data validation checks passed successfully! ---
SUCCESS: Data passed all validation checks and is ready for analysis!
Proceeding with a validated DataFrame of shape: (39245, 16)
Train-Test-Split: Obey the Golden Rule
Before proceeding with further EDA and visualization of the data, we split and stash a test set from our data in order to evaluate our model performance on unseen data in accordance with the pricinples of the Golden rule of machine learning.
Discern Features & Strategize Missing Data
With the split, complete we review on the adult_train data to understand the statistics of the numerical features and to investigate on the presence of null values.
<class 'pandas.core.frame.DataFrame'>
Index: 27471 entries, 22466 to 4241
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 27471 non-null int64
1 workclass 27471 non-null object
2 fnlwgt 27471 non-null int64
3 education 27471 non-null object
4 education-num 27471 non-null int64
5 marital-status 27471 non-null object
6 occupation 27471 non-null object
7 relationship 27471 non-null object
8 race 27471 non-null object
9 sex 27471 non-null object
10 capital-gain 27471 non-null int64
11 capital-loss 27471 non-null int64
12 hours-per-week 27471 non-null int64
13 native-country 27471 non-null object
14 income 27471 non-null object
15 income_encoded 27471 non-null int64
dtypes: int64(7), object(9)
memory usage: 3.6+ MB
age 0
workclass 0
fnlwgt 0
education 0
education-num 0
marital-status 0
occupation 0
relationship 0
race 0
sex 0
capital-gain 0
capital-loss 0
hours-per-week 0
native-country 0
income 0
income_encoded 0
dtype: int64
All null values have been handled during Data Validation.
<class 'pandas.core.frame.DataFrame'>
Index: 27471 entries, 22466 to 4241
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 27471 non-null int64
1 workclass 27471 non-null object
2 fnlwgt 27471 non-null int64
3 education 27471 non-null object
4 education-num 27471 non-null object
5 marital-status 27471 non-null object
6 occupation 27471 non-null object
7 relationship 27471 non-null object
8 race 27471 non-null object
9 sex 27471 non-null object
10 capital-gain 27471 non-null int64
11 capital-loss 27471 non-null int64
12 hours-per-week 27471 non-null int64
13 native-country 27471 non-null object
14 income 27471 non-null object
15 income_encoded 27471 non-null object
dtypes: int64(5), object(11)
memory usage: 3.6+ MB
Univariate Distribution of The Quantitative Variables
Note - Visualization of the distributions below using the altair-ally python package (Ostblom (2020)) is performed with code adapted from UBC’s DSCI-573: Feature and Model Selection Course. Reference on the Altair Ally package can be found in using this external link
We first investigate the distribution of the dataset’s quantitative variables split against the respective income brackets summarized in Figure 1. From the plots below, we pay special focus on the age distribution of the respondents. Both distributions are right-skewed. Income earners at or below USD 50,000 tend to be younger than fellow respondents earning above USD 50,000.
Of note also is the age distribution of hours worked per week, with most respondents in both income brackets reporting about 40 hours per week. The fnlweight feature is a final numerical value representing the final weight of the record. This value can be viewed as the number of people represented by the row. Without further breakdown on the methods or derivation of this value, we chose to ignore it in our analysis.
Similarly, no in depth data is provided on the capital-loss and capital-gain features, but these fields may have strong predictive value for identifying higher-income earners, so we chose to maintain this information, but rather chose to perform binary encoding e.g. any capital-gain above the threshold of zero is encoded as True while all values below zero are encoded as False. This choice was motivated by the need to maintain the information value of the features while smoothing out the noise due to a lack of detailed information from the data’s repository.